Course: Applied Geo-data Science at University of Bern (Institute of Geography)

Supervisor: Prof. Dr. Benjamin Stocker

Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider

Further information: https://geco-bern.github.io/agds/

You have questions to the workflow? Contact the Author:

Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)

Matriculation number: 20-100-178

Reference: Report Exercise 4 (Chapter 9)

1 Introduction

1.1 Objectives

This exercise is a introduction into supervised machine learning. With a realistic problem, the following goals should be trained:

  • Implementation of the KNN algorithm

  • Discuss the bias-variance trade-off

  • The role of k

1.2 Theory

Supervised machine learning is a type of machine learning where the model is trained using labeled data to predict new and unseen data and this as accurate as possible (lowest error in the new data). But this approach has some challenges. For example, we need to know how good the quality of the predicted data is. Therefore, we use only a part of the labeled data for training. With the other part of the labeld data we quantify the error.

There are many ways to train the algorithm. Here we use polynomials to approximate the target. If we use a polynomial with \(deg=0\), then we describe the data with an intercept and is probably a very poor choice. The bias will be very high since we will not hit almost any value of the target exactly. The model explains only very little of the variance of the target. The higher the degree of the polynomial, the more flexible we can bend, stretch, or compress our function. In the best case, we have found a polynomial which hits every target value in the training data exactly. The bias in the training data will be zero. We tend to say, that our best model would be the one which describes the target of the training data best. But this is not true.

Consider the following: If we use the model which approximate the target in the training data best, then we have also include all errors. The target and the predictor come from measurements and have always errors (always statistical and almost always systematic errors (measurement noise)). If we use a polynomial with \(deg = 0\) we will not have a good prediction. The model describes the target in the training data so bad that it will not be able to predict the target in a good quality. The RSME is high in the training data and will be high in the test data. We call that a underfitted model and the bias-variance trade off shifts to a very high bias. Because of the high variance in the predicted data, the RSME will be also high in the predicted data.

If we use a polynomial with \(deg = n\) then we project the error into the new data. That will cause a higher variance in the predicted data because our model is overfitted. The bias-variance trade off in the train data shifts to the variance. The goal is finding the model with the best generalisability which is the model with the best bias-variance trade-off. This exercise shows you how we can find the best model with a KNN-algorithm.

2 Methods

We use the KNN (k-nearest neighbours) algorithm and compare the model to a multivariate linear regression model. For the KNN we split our data. We use 70 % for training and 30 % to validate the model. Because the algorithm use a distance we have to standardize our data. With the box.cox function we transform our data in a normal distribution. We use set.seed for reproducibility.

2.1 R: Evaluation of the data

All analysis are done with the open source software R-Studio (Version 2022.12.0+353). This software use the language R to handle the data. For an efficient processing of the data we use the R-package “Tidyverse (and others).

3 Programming and data evaluation

3.1 Packages

The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to chose if different functions have the same call but do not make the same thing (a function in a certain package can have the same name as a function from another package).

source("../../R/general/packages.R")

3.2 Read the file

First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented a if-else statement.

name.of.file <- "../../data/re_ml_01/daily_fluxes.csv"

# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
  # Access to the data
  url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data
  /FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
  
  # Read in the data directly from URL
  daily_fluxes <- read.table(url, header = TRUE, sep = ",")
  
  # Write a CSV file in the respiratory
  write_csv(daily_fluxes, "../../data/re_ml_01/daily_fluxes.csv")
  
  # Read the file
  daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")
  # If exists such a file, read it only!
  }else{daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")}

3.3 Data Overview

3.3.1 Missing values

If we take a closer look to the file, we see that there are 334 variables. We can not continue with our usual procedure because there are to many variables. That is why we have to clean our data first.

3.3.2 Data cleaning

We select only columns we need. Further, we can see that some columns contains -9999 as a value. Our quality function changed that to NA. Than we use ymd() from the lubridate package to rewrote the date in a proper way. Further we want only columns which contains good quality. For that we check selected columns with their quality control column.If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). After that we discard all quality-control columns. Now we know that our data set is of high enough quality to perform analyses with it.

# Load the function in the file
source("../../R/re_ml_01/function.use.good.quality.only.R")

# Function call to clean the data
daily_fluxes <- use.good.quality.only(daily_fluxes.raw)

# Load the function
source("../../R/general/function.visualize.na.values.R")

# Function call
visualize.na.values.without.groups(daily_fluxes)

3.4 Implementation of the KNN algorithm

3.4.1 Split the data

Here we split our data in a training subset and a test subset (70 % training and 30 % test). After we split the data we can calculate our model. We use these three variables as predictors: SW_IN_F, VPD_F, and TA_F . For KNN, we use k = 8 as a first try. This choice is random. The last code chunk in this workflow will show you which k is the optimal one (we do not want to spoiler here. That is why you should use k = 8 first). Now we calculate a KNN model and a lm model.

# For reproducibility (pseudo-random)
set.seed(123)  
# Split 70 % to 30 % 
split <- rsample::initial_split(daily_fluxes, prop = 0.7, strata = "VPD_F")

# Split the data in a training set
daily_fluxes_train <- rsample::training(split)
# And a test set
daily_fluxes_test <- rsample::testing(split)

# Load the function in the file.
source("../../R/re_ml_01/function.train.and.test.R")

# Function call for KNN with k = 8
mod.knn <- knn.model(daily_fluxes_train, 8)

# Function call for lm
mod.lm <- lm.model(daily_fluxes_train)

3.4.2 Evaluation

After we split our data, we have to make a model evaluation. The first function call is for the lm-model. The function needs the model which we have calculated in the code chunk above. This is a multivariate regression model and therefore, we can not improve the formula.

The second function call is for the KNN model. The function needs the knn-model as a input. The knn-model has been calculated in the code chunk above. the number for k were 8. If we want a model with a different k, we have to recalculate the knn-model with the function in the code chunk above.

# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")

# Function call for linear regression model
eval_model(mod = mod.lm, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (lm)"), c("Davos (lm)"))

# Function call for KNN
eval_model(mod = mod.knn, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (knn: k=8))"), c("Davos (knn: k=8)"))

3.4.3 Visualization

Here you can see how the knn-model depends on k. If we use k=0, than we have perfect training data. But the RMSE in the test data is high (model is overfitted). If we use k = n our model would be underfitted and therefore the RMSE in the test data would also be high. For a visualization of the development of RSQ and RSME as a function of k, please take a look on the last visualization.

# Load the function into the file
source("../../R/re_ml_01/function.different.k.R")

# Define a sequence for k
my.sequence <- c(1, 8, 25, 50, 100)
# Function call
different.k(my.sequence,daily_fluxes_train,daily_fluxes_test,c("Davos"),c("Davos"))

## k-Nearest Neighbors 
## 
## 1910 samples
##    8 predictor
## 
## Recipe steps: BoxCox, center, scale 
## Resampling: None

4 Discussion

In this short discussion part, we want to show you two main aspect of this exercise and discuss this aspects shortly. First, we need to know how to judge the models. For this we need to know why one model can be considered better than another. How can we estimate the bias-variance-trade off? Second, we need to know how the KNN algorithm works. For this we need knowledge about the role of k-neighbors.

4.1 Model discussion and the role of the bias-variance trade-off

In this part we discuss the differences between a lm model and a KNN model.

4.1.0.1 Why is the difference between the evaluation on the training and the test set larger for the KNN model than for the linear regression model?

To answer this question we have to take a closer look to the formulas:

A lm model use all predictors to calculate a multivariate regression model. However, it is a linear regression. It means that the polynomial has a deg(polynomial)=1. In the training set it creates a straight line in a multidimensional space with dim = N.

\[ \hat{y}=a_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]

Obviously, a lm model do not look for any underlying pattern. It creates a linear combination and all predictors should be linear independent. If we want apply the model then we must consider this fact. To create a lm model we use the training data. But we calculate only a regression and therefore it is not possible that RSQ = 1 (only if the predictor is also the target and the predictors are linear dependent). Therefore, the lm model will always have a bias. Most likely the RMSE and MAE will increase if we make predictions because the lm model do not know anything of the underlying pattern.

The situation of KNN is different. KNN calculate a polynomial regression with the following formula

\[ \hat{y}=\sum_{n=0}^{N}\hat{a}_n*x^n \]

It use not only all predictors. It uses k neighbours to calculate a local regression. The KNN model look for an underlying pattern. That is why the RSQ in the training data is lower than for the lm model. If we want to do a prediction the KNN has a adventage over the lm. KNN “knows” the underlying pattern and the RMSE and MAE will not increase fast.

4.1.0.2 Why does the evaluation on the test set indicate a better model performance of the KNN model than the linear regression model?

Because the RMSE and MAE in the KNN is lower than in the lm model. Further, RSQ is higher in the KNN model than in the lm model. The KNN model performes a predictions whit a low bias (MAE) and low variance (high RSQ) and is better than the lm model with a higher bias (MAE is higher) and more variance (lower RSQ). It follows that the KNN model has the bestter bias-variance trade off than lm model But be careful: RMSE and RSQ in the KNN model dependence directly from k.

4.1.0.3 How would you position the KNN and the linear regression model along the spectrum of the bias-variance trade-off?

For the lm model the variance increase

The KNN model seems overfitted. The RSQ in the training data is high and jumps down for the test data. The RMSE increase from the training data to the test data. This conclude that the KNN model is more on the bias side in the bias-variance trade off.

The lm model is more or less the same.

4.1.0.4 Visualize temporal variations of observed and modeled GPP for both models, covering all available dates.

It seems that our model predict better if the target is between 2005 and 2010. We can also see that the difference between the model is not very big. But the residue is very high because the target has a quantity of about \(10^{-1}\).

# Load the function into the file
source("../../R/re_ml_01/function.time.variation.R")

# Function call
time.variation(mod.knn, mod.lm, daily_fluxes_test)

4.2 Discussion the role of “k” and generalisability of a model

The “k” in KNN is a parameter that refers to the number of nearest neighbors. Here, we use MAE instead of RMSE. Although the values are different, the flows are synchronous. That means that for both, MAE and RMSE are hyperbolas and have their global minima at the same k.

Hypothesis:

The lower we chose k the higher is RSQ in the training data but the higher the variance in the test data and therefore the higher the MAE.

Explanation: Let k = 1. Then the training data would be perfectly predicted because only one
(the nearest) neighbour would be taken into account. The bias would be zero and therefore is RSQ = 1 and the MAE is 0 as well. But the model would be totally overfitted. That means there is a bias-variance trade off. If the bias in the training data is low, then is the variance in the test data is in general high.

Let k = N. Then would follow: RSQ is an element of (0,1] and the MAE an element of [0, inf). But we do not know where it is. However, the model tends to be underfitted because the model take all neighbours into account. Therefore, the prediction would be the mean value of the observations.

Conclusion: We want a model with zero bias and zero variance. But this seems impossible so we have to find the model with the best bias-variance-trade off.

4.2.0.1 Showing model generalisability as a function of model complexity and optimal k

The hypothesis is not fully true. RSQ is a quantity to describe the variance. Or better, it describes how many procentage of the variance explains the model. For the best bias-variance trade-off we want the k where the bias in the test data is lowest (low MAE) and the variance should be On figure xx you can see, that the model is for k = 0 totally overfitted. In the training data the RSQ is 1 and it follows that the bias must be 0. But in the test data the MAE is at a global maximum. The model is useless for any predictions and the trade off is on the side with to much variance.

For k = 200 the RSQ is lowest in the training data. The MAE increase for both data sets but is not at a maximum. The model is not suitable for predictions because the trade-off is on the side with to much bias.

The optimal k is there, where the trade off is best. Generalisability means that we want the model with the lowest MAE in the test data (low bias) and the highest RSQ (low variance). For that we mark the minimum of the RMSE test data hyperbola (green point = 25). Unfortunately, k = 25 is not necessarily the best model. We have only seen that it is between 20 and 30. We make again a function call and use all k between 20 and 30. Now we see that optimal k is 25 and we can chose this model.

Important: Here we set a seed and split our data only pseudo-random. If we do not use set.seed(123), then it would be a other optimal k.

# Load the function into the file
source("../../R/re_ml_01/function.parameter.extracter.R")

# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))

# Visualize the MAE and RSQ --> optimal k is 25 (between 20:30)
parameter.extracter(my.sequence, daily_fluxes_train, daily_fluxes_test)

# Visualize MAE and RSQ for k = 20:30 --> it is 25
parameter.extracter(c(20:30), daily_fluxes_train, daily_fluxes_test)